Part 1

1.a) Describing the Dugong’s data

\[~\]

We start by loading the data.

dugongdat <- read.delim2("./dugong-data.txt", header = TRUE, sep = "", dec = ",", stringsAsFactors = FALSE)
dugongdat <- dugongdat[, -c(1)]
head(dugongdat)
##   Length Age
## 1   1.80 1.0
## 2   1.85 1.5
## 3   1.87 1.5
## 4   1.77 1.5
## 5   2.02 2.5
## 6   2.27 4.0

\[~\]

Now, we save the column of lenghts within \(y_{obs}\) and the column of ages within \(x_{i}\). After that, we can show some characteristics of these two variables:

\[~\]

cat(paste("The variance of the observations is: ", round(var(y_obs), 3), "\n", sep =""))
## The variance of the observations is: 0.075
cat("the other features are:\n")
## the other features are:
summary(y_obs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.770   2.225   2.400   2.334   2.530   2.720
hist(y_obs, col="orchid2", main="Distribution of the lenght y_obs", xlab="Length")
rug(y_obs, lwd=2)

\[~\]

cat(paste("The variance of the observations is: ", round(var(x_i), 3), "\n", sep =""))
## The variance of the observations is: 61.929
cat("the other features are:\n")
## the other features are:
summary(x_i)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    9.50   10.94   15.00   31.50
hist(x_i, col="orchid2", main="Distribution of the age x_i", xlab="Age")
rug(x_i, lwd=2)

Analyzing the informations together

\[~\]

As we can seen, in this homework the statistical model is described by a non-linear regression model; for this reason we want to see something interest considering these two variables together:

As you can seen above, this suggests that the statistical model should be a non-linear regression, also the fact of the correlation of these two variables is \(\sim \textit{0.8296}\) and the covariance is \(\sim \textit{1.7932}\).

\[~\]

1.b) Deriving the likelihood function

\[~\]

In order to establish the likelihood function, we assume that the \(Y_{i}\) are i.i.d. We define the likelihood function in this way, cosidering that \(Y_{i} \sim N(\mu_{i}, \tau^{2})\):

\[ L_{y_{obs}}(\alpha, \beta, \gamma, x_{i}, \tau^{2}) = \prod_{i=1}^{n} f(y_i|\alpha, \beta, \gamma, x_{i}, \tau^{2}) = \\ = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\tau^{2}}} \cdot exp\Big\{-\frac{1}{2\tau^{2}} ( y_{i} - \mu_{i})^{2} \Big\} = \\ = (2\pi\tau^{2})^{-\frac{n}{2}} \cdot exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} \Big\} \]

\[~\]

1.c) Joint prior distribution of the parameters

\[~\]

We need to write the joint prior distribution at stake, considering the \(\alpha, \beta, \gamma \,\, and \,\, \tau^{2}\). It is important to denote the fact that the prior is written as a multiplication of whole hyperparameters interested in our analysis:

\[ \pi(\alpha, \beta, \gamma, \tau^{2}) = f_{Normal(0, \sigma_{\alpha}^{2})}(\alpha) \cdot f_{Normal(0, \sigma_{\beta}^{2})}(\beta) \cdot f_{Unif(0, 1)}(\gamma) \cdot f_{InvGamma(a, b)}(\tau^{2}) \propto \\ \propto exp\Big\{ -\frac{\alpha^{2}}{2\sigma_{\alpha}^{2}} \Big\} \cdot \mathbb{I_{(1,\infty)}(\alpha)} \cdot exp\Big\{ -\frac{\beta^{2}}{2\sigma_{\beta}^{2}} \Big\} \cdot \mathbb{I_{(1,\infty)}}(\beta) \cdot \frac{exp\Big\{ -\frac{b}{\tau^{2}} \Big\}}{\tau^{2(a+1)}} \cdot \mathbb{I_{(0,\infty)}}(\tau^2) \cdot \frac{1}{1-0} \cdot \mathbb{I}_{[0, 1]}(\gamma) \]

In this step, I choose suitable hyperparameters for this statistical model, in order to fit well the main features recognized before.

\[~\]

Giving a suitable choice for tau

\[~\]

the variance of our observations is \(\tau^2\), we consider a combination of the prior hyper parameters, where \(\tau^2 \sim InvGamma(a, b)\), through simulations we obtained this behaviours:

As we shown above, the good prior parameters a and b for the hyperparameter \(\tau^2\) are a = 21 and b = 1.7, because they give a small uncertainty than others.

\[~\]

Giving a suitable choice for mu

\[~\]

After deciding the \(\tau^{2}\) hyperparameter is important to decide which hyper values attribute for the mean of the statistical model, is important to remind that the mean is described as \(\mu_{i} = f(x_{i}) = \alpha - \beta\gamma^{x_{i}}\).

Given that \(\mu_{i}\) is described as a hyper parameter of the non-linear model, we need to choose \(\alpha \, \text{and} \, \beta\), so the first hyperparameter to choose for generating different values of \(\alpha\) is the variance denoted as \(\sigma_{\alpha}^{2}\), by simulations we have:

In this case, we want to assign for \(\sigma_{\alpha}^{2}\) the value of 6, that shows a lower uncertainty than others, the higher value of probability of the red curve is near to the optimal dashed purple line found. This contains the intercept of the regression, in this case represents the intercept of the linear regression model, but doesn’t matter we want to find a suitable parameters of our prior belief that will be adjust following our future steps of the MCMC sampling.

Instead, for \(\sigma_{\beta}^{2}\) we find by simulations:

In this case, the value of sigma beta is 0.05, as we can seen above this gives the best fitting.

Instead, for \(\gamma\) we have the parameters a = 0 and b = 1.

\[~\]

1.d) Functional form of all full-conditionals

\[~\]

To derive the full-conditionals is important to specify, the number of full-conditionals required. In this case is 4, considering our hyperparameters: \(\alpha, \beta, \gamma, \tau^{2}\).

Before starting, is important to denote that \(\mu_{i}\) is given by \(\alpha, \beta, \gamma, x_{i}\) and \(x_{i} = {x_{1}, ... , x_{n}}\). So, let’s start with \(\tau^{2}\):

\[ \pi(\tau^{2}|\alpha, \beta, \gamma, y_{obs}) \propto \pi(y_{obs}|\alpha, \beta, \gamma, x_{i}, \tau^{2}) \pi(\tau^{2}) \propto \\ \propto \pi(y_{obs}|\mu_{i}, \tau^{2}) \pi(\tau^{2}) \propto \\ \propto (2\pi\tau^{2})^{-\frac{n}{2}} \cdot exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} \Big\} \cdot (\tau^{2})^{-(a+1)} exp\Big\{ -\frac{b}{\tau^{2}} \Big\} \propto \\ \propto (\tau^{2})^{-\frac{n}{2}} \cdot exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} -\frac{b}{\tau^{2}} \Big\} \cdot (\tau^{2})^{-(a+1)} \propto \\ \propto (\tau^{2})^{-(\frac{n}{2}+a+1)} \cdot exp\Big\{ -\frac{\sum_{i=1}^{n}( y_{i} - (\alpha - \beta\gamma^{x_{i}}))^{2}+b}{2\tau^{2}} \Big\} \]

Instead, the other full-conditionals are:

\[ \pi(\alpha|\tau^{2}, \beta, \gamma, y_{obs}) \propto \pi(y_{obs}|\alpha, \beta, \gamma, x_{i}, \tau^{2}) \pi(\alpha) \propto \\ \propto \pi(y_{obs}|\mu_{i}, \tau^{2}) \pi(\alpha) \propto \\ \propto (2\pi\tau^{2})^{-\frac{n}{2}} \cdot exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} \Big\} \cdot exp\Big\{ -\frac{(\alpha)^2}{2\sigma_{\alpha}^{2}} \Big\} \propto \\ \propto exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} -\frac{(\alpha)^2}{2\sigma_{\alpha}^{2}} \Big\} \propto \\ \propto exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n} \Big( y_{i}^{2} + \alpha^{2} + \beta^{2}\gamma^{2x_{i}} - 2y_{i}\alpha + 2y_{i}\beta\gamma^{x_{i}} - 2\alpha\beta\gamma^{x_{i}} \Big) -\frac{(\alpha)^2}{2\sigma_{\alpha}^{2}} \Big\} \propto \\ \propto exp\Big\{ -\frac{1}{2} \Big( \frac{n}{\tau^{2}} + \frac{1}{\sigma_{\alpha}^{2}}\Big)\alpha^{2} + \Big( \frac{\sum_{i=1}^{n} y_{i} + \beta\sum_{i=1}^{n}\gamma^{x_{i}}}{\tau^{2}} \Big)\alpha \Big\} \]

for \(\beta\):

\[ \pi(\beta|\tau^{2}, \alpha, \gamma, y_{obs}) \propto \pi(y_{obs}|\alpha, \beta, \gamma, x_{i}, \tau^{2}) \pi(\beta) \propto \\ \propto \pi(y_{obs}|\mu_{i}, \tau^{2}) \pi(\beta) \propto \\ \propto (2\pi\tau^{2})^{-\frac{n}{2}} \cdot exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} \Big\} \cdot exp\Big\{ -\frac{(\beta)^2}{2\sigma_{\beta}^{2}} \Big\} \propto \\ \propto exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} -\frac{(\beta)^2}{2\sigma_{\beta}^{2}} \Big\} \propto \\ \propto exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n} \Big( y_{i}^{2} + \alpha^{2} + \beta^{2}\gamma^{2x_{i}} - 2y_{i}\alpha + 2y_{i}\beta\gamma^{x_{i}} - 2\alpha\beta\gamma^{x_{i}} \Big) -\frac{(\beta)^2}{2\sigma_{\beta}^{2}} \Big\} \propto \\ \propto exp\Big\{ -\frac{1}{2} \Big( \frac{\sum_{i=1}^{n} \gamma^{2x_{i}}}{\tau^{2}} + \frac{1}{\sigma_{\beta}^{2}}\Big)\beta^{2} + \Big( \frac{ \alpha\sum_{i=1}^{n}\gamma^{x_{i}} - \sum_{i=1}^{n}y_{i}\gamma^{x_{i}} }{\tau^{2}} \Big)\beta \Big\} \]

for \(\gamma\):

\[ \pi(\gamma|\tau^{2}, \alpha, \beta, y_{obs}) \propto \pi(y_{obs}|\alpha, \beta, \gamma, x_{i}, \tau^{2}) \pi(\gamma) \propto \\ \propto \pi(y_{obs}|\mu_{i}, \tau^{2}) \pi(\gamma) \propto \\ \propto (2\pi\tau^{2})^{-\frac{n}{2}} \cdot exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} \Big\} \cdot \mathbb{I_{[0,1]}(\gamma)} \propto \\ \propto exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n}\Big( y_{i} - (\alpha - \beta\gamma^{x_{i}})\Big)^{2} \Big\} \cdot \mathbb{I_{[0,1]}(\gamma)} \propto \\ \propto exp\Big\{ -\frac{1}{2\tau^{2}} \sum_{i=1}^{n} \Big( \beta^{2}\gamma^{2x_{i}} - 2\alpha\beta\gamma^{x_{i}} + 2y_{i}\beta\gamma^{x_{i}} \Big) \Big\} \cdot \mathbb{I_{[0,1]}(\gamma)} \]

As we can seen above, there are known and unknown distributions.

\[~\]

1.e) Which distribution can you recognize within standard parametric families so that direct simulation from full conditional can be easily implemented

\[~\]

As we can seen, in the previous section we derived the full-conditionals of the four hyper-parameters \(\alpha, \beta, \gamma, \tau^{2}\).

We recognized that the distribution of \(\tau^{2} \sim InvGamma\Big(a+\frac{n}{2}, \,\, b + \frac{\sum_{i=1}^{n}(y_{i}-(\alpha-\beta\gamma^{x_{i}}))^2}{2} \Big)\) and also the \(\alpha \sim N\Big( \hat{\mu} = \frac{\frac{\sum_{i=1}^{n} y_{i} + \beta\sum_{i=1}^{n}\gamma^{x_{i}}}{\tau^{2}}}{\frac{n}{\tau^{2}} + \frac{1}{\sigma_{\alpha}^{2}}} , \,\, \hat{\sigma^{2}} = \frac{1}{\frac{n}{\tau^{2}} + \frac{1}{\sigma_{\alpha}^{2}}} \Big) \, and \, \beta \sim N\Big( \hat{\mu} = \frac{\frac{ \alpha\sum_{i=1}^{n}\gamma^{x_{i}} - \sum_{i=1}^{n}y_{i}\gamma^{x_{i}} }{\tau^{2}}}{\frac{\sum_{i=1}^{n} \gamma^{2x_{i}}}{\tau^{2}} + \frac{1}{\sigma_{\beta}^{2}}} , \,\, \hat{\sigma^{2}} = \frac{1}{\frac{\sum_{i=1}^{n} \gamma^{2x_{i}}}{\tau^{2}} + \frac{1}{\sigma_{\beta}^{2}}}\Big)\) are known distributions only the \(\gamma\) hyperparameter is the only one which we don’t know the proper distribution.

We can continue to formalize well the parameters that describing \(\alpha \, \text{and} \, \beta\), for \(\alpha\) we have:

\[ \alpha \sim N\Big( \hat{\mu} = \frac{\frac{\sum_{i=1}^{n} y_{i} + \beta\sum_{i=1}^{n}\gamma^{x_{i}}}{\tau^{2}}}{\frac{n}{\tau^{2}} + \frac{1}{\sigma_{\alpha}^{2}}} , \,\, \hat{\sigma^{2}} = \frac{1}{\frac{n}{\tau^{2}} + \frac{1}{\sigma_{\alpha}^{2}}} \Big) = N\Big(\hat{\mu} = \frac{\sigma_{\alpha}^{2}}{n\sigma_{\alpha}^{2}+\tau^{2}}\sum_{i=1}^{n}y_{i}+\beta\gamma^{x_{i}}, \,\, \hat{\sigma^{2}} = \frac{\sigma_{\alpha}^{2}\tau^{2}}{n\sigma_{\alpha}^{2}+\tau^{2}} \Big) \]

and for \(\beta\) we have:

\[ \beta \sim N\Big( \hat{\mu} = \frac{\frac{ \alpha\sum_{i=1}^{n}\gamma^{x_{i}} - \sum_{i=1}^{n}y_{i}\gamma^{x_{i}} }{\tau^{2}}}{\frac{\sum_{i=1}^{n} \gamma^{2x_{i}}}{\tau^{2}} + \frac{1}{\sigma_{\beta}^{2}}} , \,\, \hat{\sigma^{2}} = \frac{1}{\frac{\sum_{i=1}^{n} \gamma^{2x_{i}}}{\tau^{2}} + \frac{1}{\sigma_{\beta}^{2}}}\Big) = N\Big( \hat{\mu} = \frac{\sigma_{\beta}^{2}}{\sigma_{\beta}^{2}\sum_{i=1}^{n}\gamma^{2x_{i}}+\tau^{2}}\sum_{i=1}^{n}\gamma^{x_{i}}(\alpha-y_{i}), \,\, \hat{\sigma^{2}} = \frac{\sigma_{\beta}^{2}\tau^{2}}{\sigma_{\beta}^{2}\sum_{i=1}^{n}\gamma^{2x_{i}}+\tau^{2}} \Big) \]

So, we can directly simulate the full-conditional for \(\alpha, \beta, \tau^{2}\).

But, for \(\gamma\) we must do a slightly different changement.

\[~\]

1.f) Using a suitable Metropolis-within-Gibbs algorithm simulate a Markov chain (T = 10000) to approximate the posterior distribution for the above model

\[~\]

In this step, we use the Hybrid-Kernels composed in this case as Metropolis-within-Gibbs. For this reason, we simply have Gibbs samples through the known distribution of \(\alpha, \beta, \tau^{2}\) and after that, we infer on the \(\gamma\) hyperparameter where the distribution is not known. We can handle the unormalized full conditional with the Metropolis Hastings within the Gibbs samplig.

The Gibbs Sampling is an algorithm for getting draws from a posterior distribution when the conditional posteriors are known and symmetric, while we can use Metropolis-Hastings (MH) to sample from a non-conjugate conditional posterior within each Gibbs iteration.

the conditional posterior of \(\alpha, \beta, \tau^{2}\) is conjugate. Inside each Gibbs iteration we can use a standard function to sample from their proper distributions; we don’t need other second sampling here. Instead, for the \(\gamma\) hyperparameter we need to have a second sampler that is the MH.

Metropolis-Hastings Algorithm

The goal is to sample from each conditional posterior. As we can seen below, the MH sampler works in this way (for this explanation we define a general hyperparameter \(\theta\)):

  1. Define \(\pi\) as to be the target density and produce a started candidate of \(\theta^{(0)}\) in order to start the sampling.

  2. Let \(p_{x}(y)\) to be a proposal distribution, the value of x depends on the present value that has in the chain.

  3. Draw a candidate \(Y_{t+1} \sim p_{x}(y)\) and denotes that y is the realized candidate \(Y_{t+1} = y\)

  4. Decide whether or not the candidate is accepted as the next state of this chain at time t+1, if not set the next state equal to the current state x of the chain:

\[ X_{t+1} = \begin{cases} \text{y with probability} \,\, \alpha(x, y) \\ \text{x with probability} \,\, 1-\alpha(x, y) \end{cases} \]

where \(\alpha(x,y)\) = \(min\Big\{\frac{\pi(y)p_{y}(x)}{\pi(x)p_{x}(y)},1 \Big\}\)

So, the proposals that yield a higher conditional posterior evaluation are always accepted. However, proposals with lower density evaluations are only sometimes accepted. Over many iterations, draws from the posterior’s high density areas are accepted and the sequence of accepted proposals “climbs” to the high density area. Once the sequence arrives in this high-density area, it tends to remain there.

\[~\]

Gibbs sampler Algorithm

Considering the preliminary notation when \(\theta = (\theta_1, ..., \theta_i, ..., \theta_k) \in \Theta \subset \mathbb{R}^k\) Given a target \(\pi(\theta) = \pi(\theta_1, ..., \theta_k)\) the corresponding full conditionals will be denoted with:

\[ \pi(\theta_i | \theta_{(i)}) = \pi(\theta_i|\theta_1, \theta_2, ..., \theta_{i-1}, \theta_{i+1}, ..., \theta_k) \,\,\, \text{the chain changes in this way..} \\ \theta_1^{t+1} \sim \pi(\theta_1|\theta_{(1)}) = \pi(\theta_1|\theta_2^t, ..., \theta_k^t) \\ \theta_2^{t+1} \sim \pi(\theta_2|\theta_{(2)}) = \pi(\theta_2|\theta_1^{t+1},\theta_3^t, ..., \theta_k^t) \\ ... \\ \theta_k^{t+1} \sim \pi(\theta_k|\theta_{(k)}) = \pi(\theta_k|\theta_1^{t+1},\theta_2^{t+1}, ..., \theta_{k-1}^{t+1}) \\ \]

So, now we are moving on iterate and find the proper hyperparameters:

\[~\]

# MH algorithm to sample from the notknown-full-cond of gamma
notknown_fullcond_post_gamma_MH <- function(y_obs, alpha, beta, currentstate, x_ages, tau, a, greekpi, nsims){ # greekpi is the target distro
  gamma_t <- currentstate # current state of theta in the chain
  acceptance <- 0 # calculate the acceptance rate
  
  for(i in 1:nsims){
    # draw a candidate for the next/future state of the chain 
    noise <- runif(1, min = -a, max = a)
    
    gamma_prop <- gamma_t + noise
    
    # acceptance/rejection auxiliary draw
    r_num <- greekpi(y_obs = y_obs, alpha = alpha, beta = beta, gamma = gamma_prop, x_ages = x_ages, tau = tau)
    r_den <- greekpi(y_obs = y_obs, alpha = alpha, beta = beta, gamma = gamma_t, x_ages = x_ages, tau = tau)
    
    r <- r_num - r_den # ratio acceptance

    # acc/rj method
    omega = runif(1, min=0, max=1)
    ACCEPT= log(omega) <= min(r, 0)
    
    if(ACCEPT){
      acceptance <- acceptance + 1
      gamma_t <- gamma_prop
    }
  }
  
  return(list(percentage = acceptance/nsims, actual_draw = gamma_t))
}

# set the number of iterations T
niter <- 10000

# set the prior parameters of tau and the the prior parameters for sigma(alpha and beta)
a <- 21 
b <- 1.7
sigma_2_a <- 6
sigma_2_b <- 0.05

# define the variables at every iteration t
gibbs_samples <- data.frame(tau = rep(NA, niter), alpha = rep(NA, niter), beta = rep(NA, niter), gamma = rep(NA, niter))

# initialize the state t = 0
gibbs_samples[1, 1:4] <- c(0.03, 1.01, 1.01, 0) # tau is greater than zero, because is on the denominator, alpha and beta are >1 

# save for each iteration the mean of the parameters ( useful for the point 1.h )
mean_at_time_t <- list(mean_tau = rep(NA, niter), mean_alpha = rep(NA, niter), mean_beta = rep(NA, niter), mean_gamma = rep(NA, niter))
# sampling with the Hybrid-Kernels method
for(t in 1:niter){
  # sample from posterior full-conditional of tau
  gibbs_samples[t+1, "tau"] <- fullcond_post_tau(y_obs, a, b, alpha = gibbs_samples[t, "alpha"], beta = gibbs_samples[t, "beta"], gamma = gibbs_samples[t, "gamma"], x_i)
  
  # define the random-walk MH for the other parameters alpha, beta and gamma
  gibbs_samples[t+1, "alpha"] <- fullcond_post_alpha(y_obs = y_obs, alpha = gibbs_samples[t, "alpha"], beta = gibbs_samples[t, "beta"], gamma = gibbs_samples[t, "gamma"], x_ages = x_i, tau = gibbs_samples[t+1, "tau"], sigma_2_alpha = sigma_2_a)

  gibbs_samples[t+1, "beta"] <- fullcond_post_beta(y_obs = y_obs, alpha = gibbs_samples[t+1, "alpha"], beta = gibbs_samples[t, "beta"], gamma = gibbs_samples[t, "gamma"], x_ages = x_i, tau = gibbs_samples[t+1, "tau"], sigma_2_beta = sigma_2_b)
  
  gibbs_samples[t+1, "gamma"] <- notknown_fullcond_post_gamma_MH(y_obs, alpha = gibbs_samples[t+1, "alpha"], beta = gibbs_samples[t+1, "beta"], currentstate = gibbs_samples[t, "gamma"], x_ages = x_i, tau = gibbs_samples[t+1, "tau"], a = 0.4, greekpi = notknown_fullcond_post_gamma, nsims = 100)$actual_draw # after checking the good tune parameter a
  
  # save for each iteration the mean of the parameters ( useful for the point 1.h )
  mean_at_time_t$mean_alpha[t] <- mean(gibbs_samples[1:t, "alpha"])
  mean_at_time_t$mean_beta[t] <- mean(gibbs_samples[1:t, "beta"])
  mean_at_time_t$mean_gamma[t] <- mean(gibbs_samples[1:t, "gamma"])
  mean_at_time_t$mean_tau[t] <- mean(gibbs_samples[1:t, "tau"])
}

\[~\]

1.g) Show the 4 univariate trace-plots of the simulations of each parameter

\[~\]

In this section we show the behaviours of these 4 hyperparameters in order to see interesting results:

\[~\]

Remark . The parameter \(\tau^{2}\) is fixed by a number larger than 0, to avoid the problem of infinity value. If we change also the values of a, b, \(\sigma_{a}^{2}\) and \(\sigma_{b}^{2}\) the behaviour will be similar.

\[~\]

1.h) Evaluate graphically the behaviour of the empirical averages ˆIt with growing t = 1, …, T

\[~\]

In this section we want to see the empirical average of \(\mathbf{\hat{I}}_{t}\) increasing the value of \(t \, = \, 1,...,T\). Before starting, we want to write the formula of \(\mathbf{\hat{I}}_{t}\) that is:

\[~\]

\[ \mathbf{I} \approx \mathbf{\hat{I}}_{t} = \frac{1}{T} \sum_{i=1}^{T} h(\theta^{(i)}) \]

\[~\]

is important to write \(\approx\) because this leverages the SLLN theorem and we want also to confirm this assumption! So, let’s see the results:

\[~\]

\[~\]

As we can seen above, the mean of each hyperparameters follows approximately the behaviour of the markov chain, that is a very good result.

\[~\]

1.i) Provide estimates for each parameter together with the approximation error and explain how you have evaluated such error

\[~\]

Here, there are the averages of the simulations of each parameter analyzed without the burn-in period:

\[~\]

## the average of tau at t iterations is: 0.0557626702991902
## the average of alpha at t iterations is: 2.63030776430193
## the average of beta at t iterations is: 1.03338636461978
## the average of gamma at t iterations is: 0.818011804880492

\[~\] We can also visualize the different auto-correlation functions of each hyperparameter:

The ACFs follow good behaviours. Here, we see that the correlation are going further from the first lag, this is strictly decreasing, that is a good feature.

For estimating the error of our hyper parameters, we can consider initially the formula of the variance MCMC. We have:

\[ \mathbf{V}[\hat{I}_{t}] = Var\Big[ \frac{1}{T} \sum_{i=1}^{T} h(X_{t}) \Big] = \frac{1}{T^{2}}\Big[\sum_{i=1}^{T}Var\Big[h(X_{t})\Big] + \sum_{i=1}^{T} \sum_{s \neq t} Cov\Big[h(X_{t}), h(X_{s})\Big]\Big] \]

But, considering that the MC is stationary the covariance is the same whenever the lag is the same, we can than say that the variance is:

\[ \mathbf{V}[\hat{I}_{t}] = \frac{Var_{\pi}\Big[h(X_{1})\Big]}{T}\Big[ 1 + 2\sum_{k=1}^{T-1}(\frac{T-k}{T})\rho_{k}\Big] \]

Considering the fact that:

\[ Var[\sqrt{T}\hat{I}_{t}] = T\sigma_{\hat{I}_{t}}^{2} \rightarrow \sigma^{2} \Big( 1 + 2\sum_{k=1}^{\infty} \rho_{k} \Big) \]

The factor inside the parentheses is greater than 1. This is an inefficiency factor as long as \(\rho_{k}\) prevails, so is used to compute the effective sample size ESS is:

\[ T_{eff} = \frac{T}{1+2\sum_{k=1}^{\infty}\rho_{k}} \]

This suggests to use the correct number of samples in order to achieve the stationary state, in fact we used it in the variance formula in this way:

\[ \mathbf{V}[\hat{I}_{t}] = \frac{Var_{\pi}[h(X_{1})]}{T_{eff}} = \Big( 1 + 2 \sum_{k=1}^{\infty} \rho_{k}\Big)\frac{\sigma^{2}}{T} \]

For each variance of each hyper parameter analyzed we have:

\[~\]

## the variance of tau at t iterations is: 1.32965797874964e-08
## the variance of alpha at t iterations is: 3.5818080995536e-05
## the variance of beta at t iterations is: 1.24893579491673e-07
## the variance of gamma at t iterations is: 1.00119586358156e-05

\[~\]

the hyperparameter with the largest posterior uncertainty calculated with the MCMC Variance is \(\alpha\).

Remark . We can consider also a particular tool to estimate better the approximation error in order to taking into account positive and possible autocorrelations between the samples that we sampled before. In this case we can consider the Monte Carlo Standard Error (MCSE) that is an estimate of the inaccuracy of MC samples for the expectation of posterior samples. MCSE is the standard deviation around the posterior mean of these samples associated to their uncertainty during the use of the MCMC or MC algorithms.

\[~\]

## the MCSE of tau at t iterations is: 0.000116206819320402
## the MCSE of alpha at t iterations is: 0.00846300958434365
## the MCSE of beta at t iterations is: 0.000365275310456165
## the MCSE of gamma at t iterations is: 0.00480012025862532

\[~\]

We can say again that \(\alpha\) has the lower MCSE and gives a higher error w.r.t. other ones.

\[~\]

1.l) Which parameter has the largest posterior uncertainty? How did you measure it?

\[~\] The largest posterior uncertainty for the hyper parameters handled by us is based on considering the confidence interval at 95% level, So for \(\tau^2\):

\[~\]

conf_tau_t <- quantile(gibbs_samples[,"tau"], prob=c(0.025,0.975))
plot(density(gibbs_samples[,"tau"]), xlab = expression(tau^2), ylab = "Density", xlim = c(0.03, 0.11), lwd = 2, col = "red", main = "Confidence interval for tau")
abline(v = conf_tau_t[1], lty=2)
abline(v = conf_tau_t[2], lty=2)

cat(paste("lower side - ", conf_tau_t[1], "\nupper side - ", conf_tau_t[2], "\nwith difference: ", conf_tau_t[2] - conf_tau_t[1],"\n", sep=""))
## lower side - 0.0397492813270719
## upper side - 0.0781544249231553
## with difference: 0.0384051435960834

\[~\]

For \(\alpha\):

\[~\]

conf_alpha_t <- quantile(gibbs_samples[,"alpha"], prob=c(0.025,0.975))
plot(density(gibbs_samples[,"alpha"]), xlab = expression(alpha), ylab = "Density", xlim = c(2.20, 3.10), lwd = 2, col = "red", main = "Confidence interval for alpha")
abline(v = conf_alpha_t[1], lty=2)
abline(v = conf_alpha_t[2], lty=2)

cat(paste("lower side - ", conf_alpha_t[1], "\nupper side - ", conf_alpha_t[2], "\nwith difference: ", conf_alpha_t[2] - conf_alpha_t[1],"\n", sep=""))
## lower side - 2.40979026321977
## upper side - 2.94132254678438
## with difference: 0.531532283564617

\[~\]

For \(\beta\):

\[~\]

conf_beta_t <- quantile(gibbs_samples[,"beta"], prob=c(0.025,0.975))
plot(density(gibbs_samples[,"beta"]), xlab = expression(beta), ylab = "Density", xlim = c(0.98, 1.20), lwd = 2, col = "red", main = "Confidence interval for beta")
abline(v = conf_beta_t[1], lty=2)
abline(v = conf_beta_t[2], lty=2)

cat(paste("lower side - ", conf_beta_t[1], "\nupper side - ", conf_beta_t[2], "\nwith difference: ", conf_beta_t[2] - conf_beta_t[1],"\n", sep=""))
## lower side - 1.00074886696844
## upper side - 1.12019770080414
## with difference: 0.119448833835698

\[~\] For \(\gamma\):

\[~\]

conf_gamma_t <- quantile(gibbs_samples[,"gamma"], prob=c(0.025,0.975))
plot(density(gibbs_samples[,"gamma"]), xlab = expression(gamma), ylab = "Density", xlim = c(0.35, 1), lwd = 2, col = "red", main = "Confidence interval for gamma")
abline(v = conf_gamma_t[1], lty=2)
abline(v = conf_gamma_t[2], lty=2)

cat(paste("lower side - ", conf_gamma_t[1], "\nupper side - ", conf_gamma_t[2], "\nwith difference: ", conf_gamma_t[2] - conf_gamma_t[1],"\n", sep=""))
## lower side - 0.609962555952337
## upper side - 0.94298148024819
## with difference: 0.333018924295854

\[~\]

We can say also that \(\alpha\) has the higher posterior uncertainty.

\[~\]

1.m) Which couple of parameters has the largest correlation (in absolute value)?

\[~\]

Representing the correlations of the values using the heatmap, we see:

\[~\]

\[~\]

The largest value of correlation is between \(\alpha\) and \(\gamma\).

\[~\]

1.n) Use the Markov chain to approximate the posterior predictive distribution of the length of a dugong with age of 20 years

\[~\]

Basically, when we want to fill in missing data, discover new data and do predictions about the future. In our case using MCMC is trivial to find the predictive distributions.

Suppose to have a model \(\pi(y_{pred}|\theta)\) and a fully known prior distribution. As we known, the Monte Carlo produces the posterior predictive distribution of a new (future) quantity as:

\[ \pi(y_{pred}) = \int \pi(y_{pred}|\theta)\pi(\theta)d\theta \]

considering \(y_{pred}\) within the model and considering it as an unknown quantity. This option could be included within the MCMC methods with a posterior distribution \(\pi(\theta|y)\)

To approximate the posterior predictive distribution using MCMC we simply do these steps:

  1. Removes the burn-in period where the values are unstabilized
  2. Generates different means with the age of dugong equal to 20 (in this first step)
  3. Generates the different observations
  4. Approximates the posterior predictive distribution and sees the Monte Carlo standard deviation as the better approximation standard deviation error

\[~\]

## The approximation of the posterior predictive distribution is: 2.566
## The MCSE of the posterior predictive distribution is: 0.00321211317675067

\[~\]

1.o) Provide the prediction of a different dugong with age 30

\[~\]

Considering the same steps followed in the previous section, we have:

\[~\]

## The approximation of the posterior predictive distribution is: 2.606
## The MCSE of the posterior predictive distribution is: 0.00563841325676896

\[~\]

1.p) Which prediction is less precise?

\[~\] The second prediction is less precise.

\[~\]

Part 2

\[~\]

Considering the Markov Chain \((X_{t})_{t \geqslant 0}\) defined on the state space \(\mathbb{S} = \{1, 2, 3\}\) and based on the Transiction probability matrix (TPM) that defines the transiction probabilities through our states:

\[ TPM = \begin{bmatrix} 0 & \frac{1}{2} & \frac{1}{2} \\ \frac{5}{8} & \frac{1}{8} & \frac{1}{4} \\ \frac{2}{3} & \frac{1}{3} & 0 \end{bmatrix} \]

\[~\]

2.a) Starting at time t = 0 in the state X_0 = 1 simulate the Markov chain with distribution assigned as above for t = 1000 consecutive times

\[~\]

# defining the number of iterations and the starting point
niter <- 1000
current_state <- 1 

# allocates the array where stores the simulations
nsim <- rep(NA, niter+1)

# initialize the state
nsim[1] <- current_state

for(t in 1:niter){
  nsim[t+1]<-sample(S, size=1, prob= tpm[nsim[t],])
}

cat("We counted the relative frenquencies with their proportions \n\n")
## We counted the relative frenquencies with their proportions
table(nsim)
## nsim
##   1   2   3 
## 393 340 268

\[~\]

2.b) compute the empirical relative frequency of the two states in your simulation

\[~\]

prop.table(table(nsim))
## nsim
##         1         2         3 
## 0.3926074 0.3396603 0.2677323

\[~\]

2.c) repeat the simulation for 500 times and record only the final state at time t = 1000 for each of the 500 simulated chains. Compute the relative frequency of the 500 final states. What distribution are you approximating in this way? Try to formalize the difference between this point and the previous point.

\[~\]

As shown in the title, we consider to simulate 500 times 1000 simulations of the Markov Chain, saving at each iteration also the relative frequency to do other types of inferences on them!, So:

\[~\]

markovchainsampling <- function(current_state, tpm, niter = 1000){
  # allocates the array where stores the simulations
  nsim <- rep(NA, niter+1)
  
  # initialize the state
  nsim[1] <- current_state
  
  for(t in 1:niter){
    nsim[t+1]<-sample(S, size=1, prob= tpm[nsim[t],])
  }
  
  return(list(endstate = tail(nsim, n = 1), relativefrequencies = nsim))
}

# number of chains
nchains <- 500

# declares the main variables where to store the values
final_state <- rep(NA, nchains)
final_relative_frequencies <- matrix(NA, ncol=niter+1, nrow=500)

for(xT in 1:nchains){
  out <- markovchainsampling(current_state = 1, tpm = tpm)
  
  final_state[xT] <- out$endstate
  final_relative_frequencies[xT, ] <- out$relativefrequencies
}

\[~\]

Finding some interesting new values for each chain simulated, we can show some simple plots and the relative summaries of this previous step:

\[~\]

## final_state
##   1   2   3 
## 182 162 156
## final_state
##     1     2     3 
## 0.364 0.324 0.312

\[~\]

As we can see above, these are the relative frequency of the 500 simulated chains. The behaviour is sligthly changed to the fact that we consider 500 times the simulations of 1000 iterations of a Markov Chain. In this case, we are going to focus on the steady state (also known as stationary state).

We can say that the Markov Chain admits the ergodic property which implies:

  • irreducibility means that exists a sequence of transitions of non-zero probability from any state to any other
  • aperiodicity means that the states are not partitioned into sets like whole the state transitions occur cyclically from one set to another one -

Now, we want to make some inference on the relative frequencies captured at each iteration of the final state of the markov chain, so:

\[~\]

## The mean relative frequency for the state 1 is exactly: 0.392033966033966

## The mean relative frequency for the state 2 is exactly: 0.329224775224775

## The mean relative frequency for the state 3 is exactly: 0.278741258741259

\[~\]

The stationary distribution is supported by the definition of the invariant distribution in the discrete case.

\[~\]

2.d) compute the theoretical stationary distribution and explain how you have obtained it

\[~\] Considering the Transiction probability matrix (TPM) with three states mentioned before:

\[ TPM = P = \begin{bmatrix} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{bmatrix} = \begin{bmatrix} 0 & \frac{1}{2} & \frac{1}{2} \\ \frac{5}{8} & \frac{1}{8} & \frac{1}{4} \\ \frac{2}{3} & \frac{1}{3} & 0 \end{bmatrix} \]

this matrix allows us to understand with how much probability I’m going to move from a state to another state, for example there is the probability of \(\frac{1}{2}\) to go from the state 1 to 2 and so on.

The vector of probabilities \(\pi = [\pi_1, \pi_2, \pi_3]\) is related to the steady state. The \(\sum_{j=1}^{d} \pi_j = 1 \, \wedge \forall \pi_j \geqslant 0\), so if we have \(\pi_1 = 0.5, \pi_2 = 0.5 \,\text{and}\, \pi_3 = 0\) it means that we are in the state \(\pi_1\) or \(\pi_2\) with that corresponding probabilities.

The goal is to maintain steady the same vector of probabilities, in order to have the distribution forever over many times.

The stationary distribution \(\pi = [\pi_1, \pi_2, \pi_3]^{T}\) must statisfy the equations:

\[\begin{cases} \pi_1p_{11} + \pi_2p_{21} + \pi_3p_{31} = \pi_1 \\ \pi_1p_{12} + \pi_2p_{22} + \pi_3p_{32} = \pi_2 \\ \pi_1p_{13} + \pi_2p_{23} + \pi_3p_{33} = \pi_3 \end{cases}\]

this can be rewritten as:

\[\begin{equation} \pi^T P = \pi^T \Leftrightarrow \pi P = \pi \\ \begin{bmatrix} \pi_1 & \pi_2 & \pi_3 \\ \end{bmatrix} \begin{bmatrix} 0 & \frac{1}{2} & \frac{1}{2} \\ \frac{5}{8} & \frac{1}{8} & \frac{1}{4} \\ \frac{2}{3} & \frac{1}{3} & 0 \end{bmatrix} = \begin{bmatrix} \pi_1 \\ \pi_2 \\ \pi_3 \end{bmatrix} \end{equation}\]

that is equivalent to:

\[ (P^T-I)\pi = 0 \]

We have in the left side an invisible coefficient that we named as \(\lambda = 1\) that is used in the general equation be the eigenvalue of the matrix, when it exists. Since \(\pi\) must be one of the possible solutions of the following system of equations:

\[ (P^T-\lambda I)\pi = 0 \]

the \(\pi = [\pi, ...., \pi_d]\) is also the eigenvector associated with \(\lambda\). the condition: \(\sum_{j=1}^{d} \pi_j = 1 \, \wedge \forall \pi_j \geqslant 0\) is of fundamental importance in order to have the existence of the stationary distribution.

Summarizing:

  1. Find the eigenvector with the eigenvalue equal to 1
  2. Checks, if the eigenvector values are positives
  3. Then, I check if the \(\pi\) is on the stationary distribution of the Markov chain transition probability matrix

\[~\]

point 1

\[~\]

We must renormalize the eigenvector corresponding to the eigenvalue equal to 1 as follows:

\[~\]

vector_pi <- eigen(t(tpm))$vector[,1]/sum(eigen(t(tpm))$vector[,1])
vector_pi
## [1] 0.3917526 0.3298969 0.2783505

As we can seen above, we mentioned the vector_pi that is the \(\pi\) solution that satisfies the stationarity equations for our Markov chain with transition probability matrix P

\[~\]

point 2

\[~\]

eigen(t(tpm))$vector[,1]
## [1] 0.6720665 0.5659508 0.4775210

\[~\]

point 3

\[~\]

t(tpm)%*%vector_pi
##           [,1]
## [1,] 0.3917526
## [2,] 0.3298969
## [3,] 0.2783505

\[~\] There is also the possibility to have different stationary distributions.

2.e) is it well approximated by the simulated empirical relative frequencies computed in (b) and (c)?

\[~\] As we shown above, the theoretical steady state for \(\pi\) is pretty similar to the points b and c that we calculated previously:

## [1] 0.3917526 0.3298969 0.2783505